Using Random Forest to Learn Imbalanced Data
نویسندگان
چکیده
In this paper we propose two ways to deal with the imbalanced data classification problem using random forest. One is based on cost sensitive learning, and the other is based on a sampling technique. Performance metrics such as precision and recall, false positive rate and false negative rate, F-measure and weighted accuracy are computed. Both methods are shown to improve the prediction accuracy of the minority class, and have favorable performance compared to the existing algorithms.
منابع مشابه
ForesTexter: An efficient random forest algorithm for imbalanced text categorization
In this paper, we propose a new Random Forest (RF) based ensemble method, ForesTexter, to solve the imbalanced text categorization problems. RF has shown great success in many real-world applications. However, the problem of learning from text data with class imbalance is a relatively new challenge that needs to be addressed. A RF algorithm tends to use a simple random sampling of features in b...
متن کاملAn Improved Random Forest Algorithm for Class-Imbalanced Data Classification and its Application in PAD Risk Factors Analysis
The classification problem is one of the important research subjects in the field of machine learning. However, most machine learning algorithms train a classifier based on the assumption that the number of training examples of classes is almost equal. When a classifier was trained on imbalanced data, the performance of the classifier declined clearly. For resolving the class-imbalanced problem...
متن کاملOn the use of MapReduce for imbalanced big data using Random Forest
In this age, big data applications are increasingly becoming the main focus of attention because of the enormous increment of data generation and storage that has taken place in the last years. This situation becomes a challenge when huge amounts of data are processed to extract knowledge because the data mining techniques are not adapted to the new space and time requirements. Furthermore, rea...
متن کاملRandom Forest Based Imbalanced Data Cleaning and Classification
The given task of PAKDD 2007 data mining competition is a typical problem of learning from extremely imbalanced data set. In this paper, we propose a combination of random forest based techniques and sampling methods to identify the potential buyers. Our methods is mainly composed of two phases: data cleaning and classification, both based on random forest. Firstly, the data set is cleaned by t...
متن کاملMaking Class Bias Useful: A Strategy of Learning from Imbalanced Data
The performance of many learning methods are usually influenced by the class imbalance problem, where the training data is dominated by the instances belonging to one class. In this paper, we propose a novel method which combines random forest based techniques and sampling methods for effectively learning from imbalanced data. Our method is mainly composed of two phases: data cleaning and class...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004